Transformers and GPT: a Technical Introduction
The original paper “Attention Is All You Need”
Transformers from scratch and a Hacker News Discussion
Harvard’s working Python transformer implementation
luciddrains/x-transformers: “A concise but fully-featured transformer, complete with a set of promising experimental features from various papers.” (Python)
Sebastian Ruder is a Google Research Scientist who blogs about natural language processing and machine learning.
See his An overview of gradient descent optimization algorithms (2016) > Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad, and Adam actually work.